Multi-Step Structure Image Inpainting Model with Attention Mechanism

The proliferation of deep learning has propelled image inpainting to an important research field. Although the current image inpainting model has made remarkable achievements, the two-stage image inpainting method is easy to produce structural errors in the rough stage because of insufficient treatment of the rough inpainting stage. To address this problem, we propose a multi-step structured image inpainting model combining attention mechanisms. Different from the previous two-stage inpainting model, we divide the damaged area into four sub-areas, calculate the priority of each area according to the priority, specify the inpainting order, and complete the rough inpainting stage several times. The stability of the model is enhanced by the multi-step method. The structural attention mechanism strengthens the expression of structural features and improves the quality of structure and contour reconstruction. Experimental evaluation of benchmark data sets shows that our method effectively reduces structural errors and improves the effect of image inpainting.


Introduction
Image inpainting is a technology that fills the damaged area of an image into a complete image that conforms to human visual effects and cognition. With the development of deep learning technology, image inpainting has become an important research field. Image inpainting has been widely used in art restoration, special effects production for film and video, image editing, and other fields. Nowadays, image inpainting is mainly divided into two schemes: the traditional mathematical method and the deep learning method based on convolutional neural networks. Traditional image inpainting methods mostly use mathematical calculation, a non-learning method, to extract the complete information of the image and fill in the damaged area of the image. This method cannot extract the deep information of the image, which leads to the lack of semantic information in the synthesized image, resulting in abrupt restoration results which deviate from human visual perception. Especially when inpainting images with large holes, traditional inpainting methods can easily lead to failure. Because the larger the hole in the image, the more complex the internal texture, structure, and semantic information.
More recently, convolutional neural networks (CNN) in deep learning can alleviate the above problems. The deep information of the image can be better extracted by convolution calculation, and the generated image information is more abundant. Ian Goodfellow proposed the generative adversarial networks(GAN) [1] in 2014. Since then, the fields related to image generation [2][3][4] have witnessed significant development. The confrontational training mode of the network can make the network generate more realistic images. GAN is mainly composed of a generator and discriminator, both of which are composed of convolutional neural networks. The generator is responsible for generating the repaired image, then inputting the restored image and the raw image into the discriminator for identification, and feeding the identification result back to the generator. In this confrontation training process, the image generated by the generator is gradually colorful, the texture details are more abundant, and the discriminator is gradually improving its ability to distinguish the synthetic image from the natural image. To sum up, GAN is undoubtedly the most promising method for deep inpainting.
Presently, GAN-based depth image inpainting methods are divided into two directions, one-stage network and two-stage network, and their difference lies mainly in the generator. The single-stage network is primarily an end-to-end model, and the generator directly outputs the restoration image. Recently, the Shift-net proposed by Yan et al. [5] adopts a single-stage network model. By combining the shift connection layer with U-Net, for filling in missing regions of any shape with sharp structures and fine-detailed textures. This approach still fails to address the issue of indistinctness in the single-stage network model, resulting in the loss of texture details in the generated image. In subsequent work, Liu et al. [6] proposed the MED model to balance the texture and structural features with the aim of preserving both in the inpainting results.
Most of these single-stage depth image inpainting methods lack the processing of image details, resulting in overly smooth synthesized images with blurred images and a lack of texture and structural information. Two-stage generator network adopts more methods to enhance image details. This structure is like a painter creating, which first generates the structure information or sketch of the image and generates the refined image in the second stage. Yu et al. [7] proposed a Contextual Attention Model that roughly restores the entire image based on the context in the initial stage and proceeds to generate more refined results in the subsequent stage. Nazeri et al. [8] introduced the EdgeConnect Model that prioritizes restoring the image's outline in the first stage, and then completes the colorization in the second stage.
Two-stage generators can often express more vivid textures and semantic information. However, the coarse-to-fine network depends on the image restoration result of the first stage. Suppose the first stage produces a coarse image with some deviations. In that case, the second stage has a finely filled effect that deviates from human visual perception, with obvious structural and semantic errors.
To improve the reliability of the coarse-to-fine image inpainting algorithm and to address the lack of image structure processing in the coarse-to-fine network model, we propose a coarse-to-fine deep image inpainting model based on the multi-step structure inpainting model. Our proposed network focuses on the coarse stage of the restoration, as this is the basis for the final image restoration. The coarse-to-fine generator adopts the Unet structure and uses the skip connection operation to counter the gradient disappearance problem and improve the transmission ability of the image information in the deep network. In the first stage, we input the contour information of the image into a one-stage network to reconstruct the structural features of the image. Unlike previous deep fill methods, we do not rebuild the damaged area at once but inpainting it in steps. We divide the occlusion area into four sub-areas and then gradually restore it. At present, the attention mechanism is widely used in various fields [9][10][11][12]. Attention mechanisms can better extract the important content of the image and improve the feature expression ability of the network. Therefore, we introduced a structural attention module to improve the ability to extract structural information. In this attention module, the local binary pattern(LBP) operator and grayscale images enhance structural features, and gated convolution is used to extract features. Finally, the enhanced feature information is transmitted to the decoder. We added local and global discriminators to the coarse-to-fine network to improve image inpainting's details and overall effect. In particular, in the second stage of the discriminator, we additionally use perception loss and style loss to enhance the authenticity. Experimental results demonstrate that our method effectively enhances the structural reconstruction of the two-stage inpainting model and minimizes the occurrence of incorrect structures.
The main contributions of this paper are as follows: (1) We propose a coarse-to-fine inpainting network, in which we inpaint the structural information of the image in the first stage and color the reconstructed area in the second stage. At the same time, to solve the problem that the damaged area is challenging to recover, we put forward a multi-step structure inpainting scheme. (2) We introduce a structural information attention module to improve the ability to reconstruct structural information in the first stage. (3) Our proposed method outperforms existing methods in the benchmark dataset. It resolves the instability issue of two-stage models, such as the gated convolution model [13] and Edgeconnect model [8], in image structure reconstruction, resulting in improved restoration effects. Our method demonstrates better inpainting results compared to MED [6], a single-stage restoration model.

Related Work
Advances in image inpainting technology have made it possible to perform these tasks with greater precision and accuracy, resulting in more visually appealing and useful images. This has led to a growing demand for image inpainting solutions in a variety of industries, including photography, printing, and multimedia. In addition, image inpainting is becoming an important branch in the field of privacy protection [14][15][16][17][18]. We briefly reviewed the work related to this paper.

Traditional Mathematical Method
The traditional mathematical method uses non-learning techniques to fill in images before the deep learning technology matures. This kind of method is mainly based on the idea of diffusion filling and block matching.
The diffusion-based method spreads the boundary of the complete area to the damaged area to complete the filling. At first, Bertalmio M et al. [19] proposed a diffusion method based on a partial differential equation by combining the idea of art restoration. Later, methods based on partial differential equation diffusion include total variation model [20], Euler's elastica model [21], M um ford Shah model [22], etc. The diffusion-based operation has a good effect on restoring small missing areas like scratches, but it can not complete the large damaged images.
The method based on patch matching is to search the patch from the complete area of the image to complete the filling. This method can fill some large damaged areas more naturally than diffusion-based technology. However, the block matching method cannot fill the image that conforms to human vision. The filled image lacks semantic information, so it cannot generate an image that fits human cognition. Drori et al. [23] iterated through the missing area content by using the smooth interpolation method and filled the most similar image blocks into the occlusion area after adding details transformation. Criminisi et al. [24] adopted priority to calculate the restoration order and globally matched similar regions for filling, thus preferentially ensuring the propagation direction of the structural texture. The Criminisi algorithm can restore some images with simple texture structure but still cannot deal with complex content. Wilczkowiak et al. [25] can make guided fixes interactively with the user by searching for similar patch areas. Although these approaches outperform diffusion-based approaches in occlusion inpainting, they still fail to fill in more complex textures and structures, especially lacking semantic information.

Image Inpainting Based On Deep Learning
After Goodfellow et al. [1] put forward the Generative Adversarial Networks (GAN), Pathak et al. [26] introduced GAN into the field of image inpainting for the first time and put forward the Context Encoders model. The generator adopts the structure of an encoder and decoder. Context encoders model uses a convolution neural network, which can inpaint the semantic information of the image and a large occlusion area. However, the fully connected network employed between the encoder and decoder of the context encoder network results in the image being too smooth and fuzzy. After that, to enhance the texture and structural details, Iizuka et al. [27] used the fully convolutional generator and added the local discriminator to form the global and local image inpainting model. However, the double discriminator model still lacks the processing of image details. Yang et al. [28] proposed a joint generation structure to constrain the image content and texture. Yan et al. [5] constructed a single-stage generator network based on Unet network and introduced a shift connection layer to fill in any missing shape area with sharp structure and delicate texture. However, the single-stage Unet network structure still causes too smooth reconstruction results. Yu et al. [7] constructed the coarse-to-fine generator architecture, put forward the context attention mechanism, further ensured the consistency between the inpainting area and the complete area, and adopted the improved Wasserstein gan [29] stability training. Liu et al. [30] proposed partial convolution to improve the inpainting effect of irregular area occlusion. Yu et al. [13] improved the partial convolution, presented a learnable gated convolution, and allowed users to edit images interactively. Inspired by artistic creation, Nazeri et al. [8] adopted the coarse-to-fine network structure to restore the outline information of the image in the first stage and finish the delicate filling of the image in the second stage. However, in the first stage, the image contour is reconstructed at one time, so it is prone to cause the wrong outline, which eventually leads to the failure of the second stage. Liu et al. [31] designed a coherent semantic attention mechanism to strengthen the relationship between missing regional features. Liu et al. [6] think that coarse-to-fine generators often cause serious semantic errors, so they put forward a joint encoder-decoder model and introduced feature equalization operation. However, the joint encoder-decoder model still cannot completely solve the image-blurring problem generated by the single-stage model. More recently, Zhu et al. [32] proposed mask-aware dynamic filtering (MADF) and used a cascade thinning network to fill in any missing image area. Wu et al. [33] used the stack network to repair the occluded image from coarse to fine, improving the reusability of the extracted features.
In the depth inpainting model, the disadvantages of the single-stage model [34][35][36][37][38][39], such as the lack of texture details of the generated image, the fuzzy structural problem still cannot be effectively solved. Most coarse-to-fine models [40][41][42][43][44][45][46] can produce more realistic textures, so many researchers are committed to improving the coarse-to-fine image inpainting model. However, they lack details in the rough repair stage. Moreover, most coarse-to-fine networks use the one-time restoration of all damaged areas, so if the first stage produces an inaccurate image, this error will be amplified in the subsequently refined networks and ultimately creates an unreasonable image. These problems can lead to image inpatient instability, sometimes producing compelling images and sometimes severe semantic errors.

Model
To solve these problems in the coarse-to-fine model, we propose a two-stage image inpainting model, which focuses on restoring structure and outline in the first stage because it is the cornerstone of texture and color filling in the fine inpainting stage. We propose a multi-step structure inpainting model to improve the first-stage restoration effect and reduce the difficulty of structural reconstruction. Specifically, in the coarse inpainting stage, we divide the occluded area of the image into four regions, determine the inpainting order by priority calculation, and then input the first-stage network to complete the inpainting gradually. The purpose of this is to restore the most straightforward part in each step, and then when rebuilding the rest, the occluded area can sample more surrounding information to complete the filling. We all know that GAN is difficult to train, prone to model collapse, and challenging to converge. The occlusion area is divided into multiple fills, reducing the difficulty of the inpainting and reducing the pressure of the generator, thus improving the inpainting effect. In addition, we propose a structural attention mechanism to enhance the reliability of structural reconstruction. The attention mechanism uses the gray-scale and LBP images of the damaged image as input, fills the holes with six consecutive layers of gated convolution, and then sends the feature information to the decoder at each scale to complete the feature information enhancement. Then, the first-stage reconstructed contour image and the original image are input into the fine-stage network, and the color filling of the image is completed through Unet to obtain the final restored image.
The proposed network structure is shown in Figure 1. For the convenience of explaining the network structure, this figure omits the gradual inpainting stage and only shows the initial input and final inpainting results.
As shown in Figure 1, let C in ∈R H ×W ×1 be the outline of the damaged image. Let M in ∈R H ×W ×1 be the original mask. Let L in ∈R H ×W ×1 be the LBP characteristic map of the damaged image. Let G in ∈R H ×W ×1 be the gray scale of damaged image. In the backbone network of the rough inpainting stage, we use the contour image C in and mask M in of the damaged image for reconstruction. In the structural attention module, the input is the LBP feature map L in and grayscale map of the damaged image G in , and the mask M in to extract the feature information. The backbone network in the rough inpainting stage adopts the Unet structure. The first layer of convolution in the encoder will downsample the image C in to H 2 × W 2 × 16 dimension feature map, the second layer will downsample to H 4 × H 4 × 32, and the subsequent three-layer convolution network will gradually downsample to H 64 × W 64 × 512 dimension. Moreover, our decoder is symmetrical to the encoder, which ensures that the feature information is not lost in the image decoding process. The jump connection is introduced to prevent gradient disappearance caused by deep network structure effectively. At the connection between the encoder and decoder, we do not adopt the traditional full connection operation because that will cause image blur. Specifically, we use a residual block composed of three layers of residual convolutions to connect the encoder and the decoder. Compared with the fully connected layer, the convolutional layer can better express the image features. At the same time, the residual structure can effectively solve the problem of gradient disappearance caused by the deep network.
In the proposed structural attention module, we use the LBP feature map. LBP operator has good feature extraction ability, and undistorted illumination, which solves the problem that features information is difficult to extract due to uneven illumination of images. The equation for the LBP operator is as follows: where (x c , y c ) represents pixel points, LBP(x c , y c ) denotes the feature extraction of corresponding coordinate points, which is expressed as: where p is the neighborhood pixel, c is the center pixel, and s is the symbolic function, which is expressed as: The attention module uses six-layer gated convolution, the gated convolution operation is expressed as: We utilize W and V to denote different convolution filters. φ represents the ELU activation function, which is used to process the feature vector. Furthermore, σ represents the sigmoid activation function, which activates the gating operation and maps the gating value between 0 and 1.
The attention module also adopts a six-layer convolutional network, so that the dimension of the feature map becomes H 8 × W 8 × 64, which is convenient for upsampling and combining the structural feature information with the decoder.
In the fine-stage, let I in ∈R H ×W ×3 be the color damaged image. C out ∈R H ×W ×1 be the outline image of coarse-stage output. Finally, I in , C out , and M in are input into the fine-stage network to finish the final image coloring. The generator model used in the fine inpainting stage is consistent with the coarse inpainting stage.

Multi-Step Structure Inpainting
The specific process of the coarse inpainting stage is shown in Figure 2. We divide the mask into four parts and determine the inpainting order according to the priority policy. The proposed priority calculation method is shown in Equation (1). We determine that priority order according to the size of the complete area contained in each part, let pri i∈1,2,3,4 denote the priority of the i-th block mask, ∑ j D i,j denote the total number of damaged pixels in the i-th mask area, and ∑ k C i,k denote the total number of complete pixels in the i-th mask area. The higher the pri value, the higher the priority. We prioritize inpainting high-priority occlusion blocks due to their smaller masked areas and greater ease of restoration. By first reconstructing high-priority areas, there will be adding reference information available when restoring low-priority areas, leading to better inpainting results and fewer structural errors.
We show the first two steps of the multi-step structure inpainting model in Figure 2. In the first step, let M 1 ∈ R H×W×1 denote the first block of the occlusion area that needs to be restored. Input C in , M 1 into the backbone network of the coarse-stage, and input L in , G in , M 1 into the structural attention module. Then the contour image C 1 reconstructed in the first step is obtained. In the second stage of inpainting, input C in , M 1 into the backbone network of the coarse-stage, and input L in , G in , M 1 into the structural attention module. After four restoration times, the final contour restoration image C out is obtained. Finally, the image inpainting is completed through the fine-stage network. The restoration process of an image is shown in Figure 3. Based on the priority calculation, the lower right corner area has the highest priority due to its smallest occluded area, so it is restored first. In the second step, the inpainting result (b) is passed to the coarse inpainting model for reference during the second step of inpainting. While the last area has the lowest priority and the largest occluded area, the use of reference information from the previous three steps reduces inpainting difficulty, improves generator stability, and enhances the structural reconstruction effect.

Loss Function
According to the different functions of the coarse inpainting network and fine inpainting network, we use two joint loss functions to complete the training of the network.
The coarse-stage network does not contain color, style, and texture features, so we only use feature-matching loss, pixel-level reconstruction loss, and adversarial loss to constrain the structural reconstruction of the image. In the fine-stage network, we use pixel-level reconstruction loss, adversarial loss, perceptual loss, and style loss to constrain jointly.

Feature-Matching LOSS
Similar to Edgeconnect [8], we used Feature-matching loss to guarantee the reconstruction effect of the structure image.
where L is the final convolution layer of the discriminator, N i is the number of elements in the i'th activation layer, and D (i) 1 is the activation in the i'th layer of the discriminator.

Reconstruction Loss
To improve the refinement at the pixel level of the image, we use L1 distance as the reconstruction loss to measure the error between the predicted image and the actual image.
I out represents the restored image and I gt is the real image. M in ∈ 0, 1 is the input mask, where 0 represents the mask area, 1 represents the complete area, and represents element-by-element multiplication.

Adversarial Loss
According to the characteristics of network training, we utilize relativistic average LS adversarial loss [6] to stabilize the training of GAN.
where D ra (·) is defined as: D ra (I gt , I out ) = sigmoid(C(I gt ) − E I out [C(I out )]). (9) where C(·) indicates the discriminator without the last sigmoid function.

Perceptual Loss
Inspired by human perception, human beings can receive the color and texture information conveyed by the image surface and understand the deep semantic information in the image. So we adopt the perception loss [47] to guide the generator to generate the image more in line with human perception.
where Φ i is the activation map of the i layer of the VGG-16 network pre-trained on Ima-geNet, the corresponding layers to Φ i in this paper are relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1.

Style Loss
Further, with the perception loss, we adopt the style loss [47] to constrain and keep the consistency between the style and the original image.
where G Φ j is the Gram matrix with a size of C j × C j constructed by the activation graph, and the activation graph Φ i comes from the above perceptual loss.

Joint Loss
In a coarse-stage network, the joint loss function is defined as follows: In a fine-stage network, the joint loss function is defined as follows: where λ f m , λ rec , λ adv , λ perc and λ style are the hyperparameters. According to the experience in [6], we set λ f m = 1, λ rec = 1, λ adv = 0.1, λ perc = 0.1, λ style = 250.

Comparison of Loss Function
To visually demonstrate the impact of each loss function on image inpainting outcomes, we conducted a comparative experiment for each loss function. As can be seen from Figure 4, the absence of feature-matching loss leads to a deviation in the color and structural features of the object. The removal of reconstruction loss results in a failure of image inpainting. Adversarial loss is crucial for preserving the texture details of the image. Perceptual loss and style loss further enhance the visual effect.

Experiments
We validate the proposed method on two benchmark datasets: Places2 [48] and CelebA [49]. The damaged image was generated using the mask dataset [30]. The mask dataset contained 12,000 randomly generated mask images. In the experiment, we used four kinds of mask areas as the contrast experiments: 10∼20, 20∼30, 30∼40 and 40∼50. We compare three excellent models to prove the proposed method: GC [13], EC [8], MED [6]. We use random sampling method to extract several categories of images from Place2 for comparative experiments, and the ratio of the training set to test set is 9:1. Figures 5 and 6 show the visual evaluations of the proposed method and the three models on two benchmark datasets. The comparison results on the Places2 dataset show that GC [13] and EC [8], which also use the two-stage generator, often have incorrect semantics, and the structure is too abrupt. MED [6] using a single-stage generator shows such problems as fuzzy texture and inconsistent color. In Figure 5, GC [13], EC [8], and MED [6] fill a large area of damaged image at one time, causing excessive pressure on the generator, which eventually leads to the instability of the restored image, frequent semantic errors. Moreover, the restored facial features lack delicate structure. Our proposed method can effectively solve these problems. The multi-step structure inpainting model is used to reduce the pressure on the generator. The details of the reconstructed image are more exquisite, the texture and color fit the reality, and the additional structural attention module can better fill the structural information, which greatly improves the reliability of the coarse-stage contour reconstruction network. In Figure 5, the first and last lines demonstrate that our method is able to fill in more intricate textures. The second and third lines show that our method effectively reconstructs the structural characteristics of the building without structural errors. In the comparison between the first line and the second line in Figure 6, our method restores more delicate facial features. The third and fourth lines illustrate that our method generates mouth details more realistically. Experiments show that our multi-step inpainting model can effectively improve the stability of the generator and ensure that the generator can generate images with rich details. The attention mechanism based on the LBP operator can better extract the structural information of the image, strengthen the expression ability of the generator on structural features, and further ensure the inpainting effect in the rough inpainting stage.

Numerical Evaluations
In the numerical evaluation, we used PSNR, SSIM [50] and FID to measure our method and the other three methods. We divided the mask proportion into four groups: 10∼20, 20∼30, 30∼40, 40∼50. Table 1 shows the comparison results of Places2 and Table 2 shows the results of CelebA. The higher the values of PSNR and SSIM, the better the inpainting result, and the lower the value of FID, the higher the similarity with the original image.
The results displayed in the table demonstrate the superior performance of our model compared to other methods on both Places2 and CelebA datasets. The high PSNR and SSIM values indicate that our proposed method generates high-quality images with structural features that are more aligned with reality. The low FID value indicates that the use of a multi-step inpainting model has effectively stabilized the generator and produced inpainted images that are more consistent with the original images. In summary, the multistep inpainting model and LBP operator-based attention mechanism effectively enhance the inpainting results in the coarse inpainting stage and minimize structural errors in the final inpainted image.

Ablation Study
In this section, we will continue to validate the proposed method through ablation experiments. We randomly selected a subset of images from the Places2 data set for use in the ablation study. We conducted ablation studies on the multi-step structure inpainting model and the structural attention mechanism to prove the effectiveness of the proposed method.

Multi-step Structure Inpainting Model
To better prove the effectiveness of the method, we control the consistency of other structures and parameters of the model so that whether to enable a multi-step inpainting model is the only variable, and finally verify the proposed multi-step inpainting model on the same mask. As shown in Figure 7, when the entire damaged area is reconstructed at one time, it is difficult to ensure the inpainting effect of the image structure. The resulting image is unstable, and the structure has fractures and semantic errors. The visual effect is improved when we use the multi-step inpainting model, and the generated image conforms to human visual cognition.

Structural Attention Mechanism
In the ablation study of the structural attention mechanism, we control the consistency of other model structures and parameters. The comparison of whether the structural attention mechanism is adopted is shown in Figure 8. We can see from the figure that the structural attention mechanism enhances the structural inpainting of the image and improves the extraction ability of structural features, thus improving the contour reconstruction effect.

Conclusions
This paper presents a multi-step structure inpainting model to solve the unstable problem of the coarse-to-fine image inpainting model in the first phase. Using the multistep structure inpainting model reduces the difficulty of image generation, and the effect of image contour rebuilding in the coarse inpainting stage is improved. In addition, we introduced a structure attention mechanism to extract more abundant structure information and enhance the ability to express image structure information. The experimental results on the benchmark dataset show that our proposed method is effective. In future work, we plan to improve the inpainting methods in the fine inpainting stage to fill in finer textures and vivid colors.